Gapped Local Similarity Search with Provable Guarantees
نویسندگان
چکیده
We present a program qhash, based on q-gram filtration and high-dimensional search, to find gapped local similarities between two sequences. Our approach differs from past q-gram-based approaches in two main aspects. Our filtration step uses algorithms for a sparse all-pairs problem, while past studies use suffix-tree-like structures and counters. Our program works in sequence-sequence mode, while most past ones (except QUASAR) work in pattern-database mode. We leverage existing research in high-dimensional proximity search to discuss sparse all-pairs algorithms, and show them to be subquadratic under certain reasonable input assumptions. Our qhash program has provable sensitivity (even on worst-case inputs) and average-case performance guarantees. It is significantly faster than a fully sensitive dynamicprogramming-based program for strong similarity search on long sequences.
منابع مشابه
Beyond Pairwise: Provably Fast Algorithms for Approximate k-Way Similarity Search
We go beyond the notion of pairwise similarity and look into search problems with k-way similarity functions. In this paper, we focus on problems related to 3-way Jaccard similarity: R = |S1∩S2∩S3| |S1∪S2∪S3| , S1, S2, S3 ∈ C, where C is a size n collection of sets (or binary vectors). We show that approximate R similarity search problems admit fast algorithms with provable guarantees, analogou...
متن کاملLearning and Inference via Maximum Inner Product Search
A large class of commonly used probabilistic models known as log-linear models are defined up to a normalization constant.Typical learning algorithms for such models require solving a sequence of probabilistic inference queries. These inferences are typically intractable, and are a major bottleneck for learning models with large output spaces. In this paper, we provide a new approach for amorti...
متن کامل"Provable ICA with Unknown Gaussian Noise, with Implications for Gaussian Mixtures and Autoencoders"
We present a new algorithm for Independent Component Analysis (ICA) which has provable performance guarantees. In particular, suppose we are given samples of the form y = Ax+ η where A is an unknown n× n matrix and x is a random variable whose components are independent and have a fourth moment strictly less than that of a standard Gaussian random variable and η is an n-dimensional Gaussian ran...
متن کاملSelecting Diverse Features via Spectral Regularization
We study the problem of diverse feature selection in linear regression: selecting a small subset of diverse features that can predict a given objective. Diversity is useful for several reasons such as interpretability, robustness to noise, etc. We propose several spectral regularizers that capture a notion of diversity of features and show that these are all submodular set functions. These regu...
متن کاملRelaxation in text search using taxonomies
In this paper we propose a novel document retrieval model in which text queries are augmented with multi-dimensional taxonomy restrictions. These restrictions may be relaxed at a cost to result quality. This new model may be applicable in many arenas, including multifaceted, product, and local search, where documents are augmented with hierarchical metadata such as topic or location. We present...
متن کامل